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Figoie  1:  Fanction  on  foni  variables 

1  Background  Into  The  Pattern  Theory  Approach 

The  Pattern  Theory^  paradigm  focuses  on  two  central  ideas  shown  in  this  section.  The  first  is  functions  that 
the  investigator  wishes  to  learn,  have  low  decomposed  function  cardinality.  The  second  is  functions  with  low 
decomposed  function  cardinality  are  learnable  with  a  relatively  small  number  of  samples.  In  this  section,  we  will 
present  some  background  on  function  decomposition  and  how  Pattern  Theory  uses  this  as  a  robust  way  to  find 
patterns. 

Decomposing  a  function  involves  breaking  it  up  into  smaller  subfnnctions.  These  smaller  functions  are  further 
broken  down  until  all  subfhnctions  will  no  longer  decompose.  For  a  ^ven  fanction,  the  number  of  ways  to  choose 
two  sets  of  variables  (the  partition  space)  is  exponential.  The  decomposition  space  b  even  larger,  since  there 
ate  several  ways  the  subfimctions  can  be  combined  and  there  are  several  levek  of  subfunctions  possible.  The 
complexity  measure  that  we  use  to  determine  the  relative  predictive  power  of  different  fanction  decompositions 
b  called  Decomposed  Fanction  Cardinality  (DFC). 

DFC  b  calculated  by  adding  the  cardinalities  of  each  of  the  subfunctions  in  the  decomposition.  The  cardinality 
of  .an  n-variable  binary  fanction  b  2”.  We  illastrate  the  measure  in  the  above  figures.  In  Figure  1,  we  have  a 
fanction  on  four  variables  with  cardinality  2*  =  16.  In  Figure  2,  we  show  the  same  fanction  after  it  has  been 
decomposed.  The  DFC  of  thb  representation  for  the  original  fanction  b  2’  +  2^  +  2^  =  12.  The  DFC  measures 
the  relative  complexity  of  a  fanction.  When  we  search  through  the  possible  decompositious  for  a  function,  we 
choose  one  with  the  smallest  DFC.  Thb  decomposition  b  our  learned  concept. 

The  decomposed  representation  of  the  fanction  b  one  that  exhibits  more  information  than  the  alternative.  For 
example.  Figure  1  b  essentially  a  lookup  table  of  inputs  and  outputs.  Figure  2,  on  the  other  hand,  b  a  fanction 
that  b  not  simply  a  table.  The  decomposition,  for  example,  could  be  two  simple  functions  combined  together. 

Throughout  the  paper  when  we  refer  to  a  minimal  fanction  decomposition,  we  use  "minimal”  to  mean  a 
decomposition  such  that  the  DFC  b  the  smallest  possible  for  the  entire  set  of  decompofitions.  It  b  noted  that 
a  pven  minimal  decompoation  b  not  unique.  For  a  more  rigorous  explanation  of  the  inner  workings  of  fanction 
decomposition  or  fanction  extrapolation,  the  reader  b  referred  to  [1],  [2]  and  [8]. 

An  important  point  b  that  a  fanction  with  a  low  DFC  has  been  experimentally  and  theoretically  determined  to 
be  learnable  with  a  small  number  of  samples  [8].  Also,  functions  we  are  interested  in  learning,  (i.e.,  functions  that 
are  highly  "patterned,”)  have  alow  DFC.  The  Fanction  Learning  And  Synthesb  Hot-Bed  (FLASH)  was  developed 
to  explore  fanction  decomposition,  and  pattern  finding.  Thb  paper  will  show  that  the  FLASH  program  exhibits 
promising  results  for  finding  patterns  robustly. 

‘System  Concept*,  Wright  Leboretory,  WL/AARr-2  2890  C  Street  STB  1,  Wright-Pettenon  AFB,  Ohio  45433-7408  Email: 
goldinazuOaa.wpafb.afjtnil 
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Figuie  2:  Decomposed  function  on  foni  variables 


2  The  C4.5  System 

C4.5  is  a  machine  learning  software  package.  A  detailed  study  of  it  is  given  in  [7].  The  intention  of  this  section 
is  to  familiarise  the  reader  with  general  information  about  how  C4.5  learns  a  concept.  It  is  important  to  note 
that  C4.5  is  equipped  to  handle  noisy  data,  conflicting  data,  continuous  variables,  and  other  features  which  are 
not  our  primary  concern  in  this  discussion.  Although  pattern  theory  is  interested  in  these  issues,  we  are  testing 
performances  given  binary  variables  and  100%  truthful  data. 

C4.S  is  a  dedaion-tree  and  rule-based  system.  This  makes  it  a  shallow  reasoner.  In  other  words,  a  deep 
understanding  of  the  world  is  not  required.  The  advantages  of  a  shallow  reasoner  ate  in  the  separation  of  knowledge 
and  control,  there  is  a  natural  mapping  to  rules,  the  rules  are  modular,  and  it  is  easy  to  provide  an  explanation. 
The  disadvantages  are  the  brittleness  associated  with  an  implicit  domain  model,  it  lacks  common  sense,  it  lacks 
robustness,  there  are  problems  with  formal  verification,  and  often  they  have  limited  learning  capabilities.  A  rule- 
based  system  works  best  with  diagnosis,  configuration  and  control,  and  process  control.  Moreover,  rule-based 
systems  are  excellent  for  any  system  that  has  independent  states,  simple  control  flow  (limited  branching),  and 
the  ability  to  state  knowledge  needed  without  stating  how  it  was  obtained. 

The  C4.5  system  has  many  different  options  that  can  be  altered  by  the  user  for  a  given  learning  environment. 
The  default  options  are  as  follows.  First,  given  a  training  set,  C4.5  builds  a  decision  tree  using  the  gain  ratio 
criterion.  In  short,  C4.5  chooses  a  test  for  the  tree  if  it  splits  the  training  data  into  two  unbalanced  groups  (i.e.. 
Only  Positive,  Only  Negative,  Largest  Poative).  The  measure  is  also  normalised.  In  essence,  the  gain  metric  is 
a  measure  of  Entropy.  Second,  C4.5  has  a  threshold  default  of  2  for  a  given  test  in  a  tree.  The  test  must  have  at 
least  2  outcomes  with  a  minimum  number  of  cases.  To  be  more  precise,  the  sum  of  the  weights  of  the  cases  for 
at  least  two  subsets  must  attain  a  minimum  of  2.  We  would  increase  this  value  if  we  had  noisy  data. 

Other  flexilulity  built  into  C4.5  includes  changing  the  amount  of  pruning  of  the  decision  tree  (for  more  gen¬ 
eralisation  and  better  predictability  with  noisy  data),  allowing  C4.5  to  choose  among  n  best  trees,  windowing, 
debugging,  use  of  continuous  variables,  using  the  older  unnormalised  gain,  and  various  options  for  the  rule  in¬ 
duction  program.  For  our  purposes,  we  will  not  be  concerned  with  pruning  since  we  are  interested  in  C4.5’s  best 
classification  of  the  training  data.  We  will,  however,  test  different  weight  minimums  for  the  gain  metric,  vary  the 
number  of  trees,  test  grouping,  and  change  minimnmn  and  maximnms  for  windowing  sises. 

Windowing  in  C4.5  is  a  feature  that  is  used  when  creating  the  initial  tree  in  the  test  cases.  The  procedure  is 
to  select  a  random  number  of  training  cases  and  build  a  tree.  This  tree  is  then  used  to  classify  the  remaining 
training  cases.  Any  misclassifications  are  used  in  a  new  refinement  of  the  original  tree.  The  cycle  is  repeated 
until  a  tree  is  built  that  correctly  classifies  all  of  the  training  data.  C4.5  allows  you  to  alter  the  number  of  cases 
to  be  included  in  the  initial  window.  It  also  lets  you  specify  a  maximum  number  of  cases  that  can  be  added  to 
the  window  at  each  iteration.  The  grouping  option  for  C4.5  allows  the  method  to  group  discrete  attributes  by 
value.  Quinlan  describes  this  procedure  in  detail  when  we  have  discrete  variables  with  many  values.  The  purpose 
of  grouping  is  to  prevent  forced  binary  splits.  It  uses  an  iterative  merging  technique  on  the  training  elements.  We 
were  uncertain  at  the  time  of  testing,  if  this  grouping  would  have  wy  relevance  to  our  binary  variable  domain. 
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Theiefote,  to  be  thoiough,  it  was  better  to  try  the  method  than  to  ignore  it. 

It  is  nseliil  at  this  point  to  discuss  the  option  that  allows  C4.5  to  build  several  trees  retaining  the  best.  The 
reason  C4.5  doesn’t  produce  an  optimal  tree  (optimal  in  the  sense  that  it  is  the  smallest  decision  tree  possible, 
consistent  with  the  training  set)  every  time  is  because  this  problem  is  NP-complete  [5].  Thus,  the  gain  metric  is 
only  a  heuristic  to  build  a  near  optimal  tree  in  polynomial  time. 

Some  shortcomings  of  C4.5  are  mentioned  in  [7].  One  is  that  C4.5  cannot  correctly  classify  cases  in  which 
there  are  non-rectangular  repons.  For  example,  in  dealing  with  continuous  variables  on  a  two-dimensioned  plane, 
the  line  y  =  x(x,  y  >  0)  does  not  lend  itself  to  building  rectangular  regions.  Instead,  the  triangular  regions  are 
approximated.  Problems  arising  related  to  this  are  poorly  delineated  regions  and  fragmented  regions.  The  author 
attributes  fragmented  regions  to  not  having  enough  data  to  correctly  classify. 

3  Description  of  Benchmark  Set 

Our  benchmark  set  of  functions  will  be  used  to  compare  the  learning  ability  of  C4.5  and  Pattern  Theory.  This 
set  of  functions,  although  not  exhaustive,  is  designed  to  include  many  t3rpes  of  relationships  that  we  might  be 
interested  in.  The  overall  goal  of  testing  on  several  different  functions  is  to  compare  robust  learning  ability  in 
this  restrictive  domain  of  binary  variables.  However,  it  is  important  to  point  out  that  although  Pattern  Theory 
is  not  yet  equipped  to  handle  continuous  variables,  the  underlying  theory  generalises  to  discrete  and  continuous 
variables.  The  reader  is  invited  to  a  formal  proof  and  further  reading  in  [8].  The  point  is  key  since  the  binary 
domain  is  so  restrictive. 

The  benchmark  set  includes  some  4  dosen  functions.  The  categories  break  down  into:  Boolean  Expressions, 

String  Functions,  Images,  S]rmmetTic  Functions,  Numerical  Functions,  and  Random  Functions.  All  of  the  func¬ 
tions  are  of  the  form  F  :  [0, 1]*  — *  [0, 1],  In  other  words,  there  ate  eight  binary  variable  inputs  and  one  binary 
variable  output.  A  detaOed  description  of  each  fimction  is  given  here. 

3.1  RANDOM 

There  are  3  functions  that  were  randomly  generated  from  FLASH  with  seeds  1,2,  and  3.  They  are  labeled: 
mdl,  md2,  and  md3. 

3.2  RANDOM  MINORITY  ELEMENTS 

There  are  5  functions  generated  which  have  a  fixed  number  of  minority  elements  placed  at  random.  The  seed 
for  each  was  1.  They  are  labeled:  md.ml,  md^5,  md^lO,  md^25,  mdjnSO. 

3.3  BOOLEAN  EXPRESSION 

These  10  KDD  functions  were  designed  to  represent  concepts  in  a  database.  KDD  stands  for  Knowledge 
Discovery  in  Databases.  They  were  first  used  in  [3]  and  later  in  [4]. 

KDDI  =  (X1X3)  +  X2 
KDD2  =  (xiX3X3)(X4  +  Xs) 

■KDDZ  —  (xi  +  X2)  -|-  (xixsxe) 

KDDA  =  X4 

KDD5  =  (X1X3X4)  +  (X3X5X7X8)  -1-  (zixjxsxsxs)  +  (*3X5) 

KDD6  =  X]  -H  X4  -|-  xs  -t-  xs 

KDDI  =  (X1X3)  -I-  (X3X4)  -f-  (X5X5)  4-  (xrxs) 

KDD8  =  (xiX))  XOR  (xixs) 

KDD9  =  (x3  XOR  X4)(xx  XOR  (xsztZs)) 

KDDIO  =  (xi  =»  X4)  XOR  (x7Xs(x3  +  X3)) 

multiplexer,  mux6,  used  in  Kosa,  this  is  a  2-address  bit,  4-data  bit  multiplexer  with  two  vacuous  variables  (xq 
and  xi)  to  make  8  inputs.  Generated  mux6  with  FLASH  and  then  edited  to  make  mux8.  3/29/94. 

"Deep  functions’’  generated  by  Mike  Noviskey:  04.26.94and-or_chain8, (removed  or.andjchain8  because  or^and-chain8(x) 
=  not(and-orjchain8(not(x)),  as  in  DeMorgan’s  Theorem) 
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3.4  VARIATION  ON  THE  MONK  PROBLEMS 

These  aie  8  binary  variable  approximation  to  the  Monk’s  problems  of  [9]. 
xi  :  head  shape  (rnd,  octagonal) 

X3  :  body  shape  (md,  octagonal) 

X3  :  smiling  (yes,  no) 

X4,  xs  :  holding  (sword,  balloon,  flag,  M16) 
xs,  zr  :  jacket  color  (red,  yellow,  green,blne) 

Xs  :  has  tie  (yes,  no) 

monkishl:  head  shape  equals  body  shape  or  jacket  is  red. 
monkish2:  exactly  2  of  6  attributes  have  1st  value. 

monkishS:  (jacket  green  b  has  sword)  or  (jacket  not  blue  and  body  not  oct.)  generated  with  FLASH,  4/6/94. 

3.5  STRING  FUNCTIONS 

These  functions  are  operators  on  8-bit  binary  strings,  palindrome  acceptor;  pal,  from  FLASH  2/18/94. 
palindrome  output;  pal-output,  from  Mike  Noviskey,  and  PVWave,  rsindomly  generated  128  bits  then  mirror 
imaged  them  to  create  the  outputs  of  an  8  variable  function.  3/25/94 

doubly  palindromed  output;  paLdbLoutput,  Horn  Mike  as  above  except  he  generated  64  bits  and  flipped  them 
twice.  3/25/94 

2  interval  acceptors  &om  FLASH  2/18/94; 
intervall  accepts  strings  with  3  or  fewer  intervals 
interval2  accepts  strings  with  4  or  fewer  intervals 
2  sub-string  detectors  from  FLASH  2/18/94; 
substrl  accepts  strings  with  the  sub-string  "101” 
substt2  accepts  strings  with  the  sub-string  "1100” 

3.6  IMAGES 

These  functions  are  various  bit  maps.  cliXIY*  means  character  X  from  font  Y  of  the  Borland  font  set.  All  were 
generated  with  the  Pascal  program  char&.exe  of  2/28/94. 
ch8{D  -  kind  of  a  flat  plus  sign 
chl5fD  -  an  Aitec  looking  design 
ch22fD  -  horisontal  bar 
ch30fD  -  solid  isosoles  triangle 
ch47{0  -  slash 

chl76fD  -  every  other  column  of  a  checker  board 

chl77fD  -  checker  board 

ch74fl  -  triplex  J 

ch83f2  -  small  S  (thin  strokes) 

ch70f3  -  sans  serH  F 

ch52f4  -  Gothic  4 

3.7  SYMMETRIC  FUNCTIONS 

These  functions  are  symmetric,  meaning  re-arranging  the  order  of  the  inputs  does  not  affect  the  output, 
parity,  from  FLASH  2/22/94.  contains-4jones,  (f(x)=l  if  and  only  if  the  str  x  has  k  ones),  from  FLASH  3/2/94. 
majority ^ate,  f(x)=l  if  and  only  if  x  has  more  I’s  than  O’s,  from  FLASH  3/2/94. 
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3.8  NUMERICAL  FUNCTIONS 


These  functions  are  various  arithmetic  operators. 

addition;  addO,  add2,  add4  -  outputs  bits  of  a  4  bit  adder,  0  is  the  most  significant  bit,  generated  with  FLASH 
2/22/94. 

greater.than:  /(xi,  xj)  =  1  if  and  only  if  xi  >  xj,  generated  with  FLASH  3/2/94. 

subtraction:  subtraction!,  subtractions  -  output  bits  1  and  3  of  the  absolute  value  of  a  4  bit  difference.  0  is 
most  significant  bit.  generated  with  FLASH  3/2/94. 

modulus2,  output  bit  2  of  4-bit  modulus  0  is  the  most  significant  bit,  generated  with  FLASH  2/22/94. 
remainder2,  output  bit  2  of  4-bit  remainder  0  is  the  most  significant  bit,  generated  with  FLASH  2/22/94. 


4  Experimental  Design 

The  overall  design  of  our  experiment  is  as  follows.  First,  several  options  were  tested  on  all  the  benchmark 
functions  in  order  to  determine  what  parameters  yielded  the  best  performance  for  C4.5.  Next,  the  resulting 
learning  curves  were  compared  with  Pattern  Theory. 

The  tests  on  the  individual  functions  were  as  follows.  First,  each  method  was  given  a  random  set  of  data  to 
train  on  ranpng  from  25  to  250  out  of  a  total  of  256  possible  cases.  Once  the  method  was  trained,  the  entire  256 
cases  were  tested  and  the  number  of  differences  were  recorded  as  errors.  This  procedure  was  repeated  10  times 
for  a  given  sample  training  sise  in  intervals  of  25  yielding  a  maximum,  minimum,  and  average  number  of  errors 
for  each.  Thus,  the  total  number  of  runs  for  each  function  was  100  of  var3ring  sample  sise.  None  of  the  learning 
was  incrementaL  All  of  the  runs  were  independent. 

Our  first  task  was  to  find  the  best  options  to  maximise  C4.5’s  performance  over  the  entire  training  set.  The 
options  that  were  varied  on  C4.5  include  the  weight  (threshold  value  for  branching),  initial  windowing  sise, 
maximum  window  sise,  grouping,  and  the  number  of  trees  grown.  The  results  are  displayed  in  Ikbles  1-6.  The 
tows  ate  for  each  function  tested.  The  columns  are  a  description  of  the  options  given.  The  value  in  the  table  is 
the  average  number  of  errors  for  a  given  run,  for  a  given  function  over  the  entire  sampling  of  25  to  250  samples 
(the  average  of  all  100  points).  The  value  at  the  bottom  is  the  average  number  of  errors  for  a  given  set  of  options 
over  the  entire  set  of  functions.  The  smaller  the  number  here,  the  better  the  overall  performance. 

The  data  is  divided  into  six  separate  tables.  Table  1  shows  the  relative  performances  of  C4.5  with  all  of  the 
default  options,  varying  the  number  of  trees.  The  first  column  is  one  tree  (the  default),  the  second  column  is  10 
trees,  and  the  third  column  is  100  trees.  T^ble  2  shows  the  relative  performance  of  C4.5  with  the  default  options 
except  that  the  windowing  sise  (-w)  is  set  to  0.  Again,  the  three  columns  vary  the  number  of  trees  &om  1  to  10 
to  100.  Table  3  has  all  of  the  default  options  except  the  threshold  parameter  (-m)  is  set  to  0.  Once  again,  the 
number  of  trees  are  1,  10,  and  100  respectively.  Table  4,  like  the  others,  has  all  the  default  options  except  the 
threshold  is  set  to  0  and  the  window  sise  is  set  to  0. 

Table  5  is  slightly  different.  The  first  column  tests  all  of  the  default  options  with  the  threshold  set  to  -1.  This 
was  compared  with  the  identical  run  where  the  threshold  was  set  to  0  (column  1  of  Table  3)  to  ensure  that  0  was 
indeed  the  lowest  possible  setting  for  the  threshold  parameter.  The  second  column  has  the  default  options  except 
the  threshold  is  set  to  0  and  the  maximum  number  of  allowable  cases  is  256  (-i  256).  This  means  essentially  that 
since  this  is  only  an  eight  variable  function,  we  have  no  imposed  maximum  number  of  cases.  Column  three  is  the 
default  options  except  the  threshold  is  set  to  1  and  the  number  of  trees  built  is  10. 

Finally,  Table  6  shows  some  final  experiments  using  grouping  (-s)  in  addition  to  our  best  options  given  so  far. 
Colunm  1  is  the  default  options  except  the  threshold  is  0  and  the  number  of  trees  built  is  ten.  Column  2  is  the 
same  except  the  threshold  value  is  1. 

If  we  examine  all  the  tables  in  detail,  we  can  conclude  that  having  the  threshold  value  set  to  0  (smallest 
possible)  or  1,  there  is  a  significant  decrease  in  the  number  of  errors  when  compared  with  the  default.  There 
does  not  appear  to  be  any  significant  difference  between  a  threshold  of  0  and  1.  As  far  as  changing  window 
sise  parameters,  there  is  no  significant  change.  In  fact,  once  we  use  10  or  more  trees,  the  initial  window  sise 
parameter  does  not  make  any  difference.  The  grouping  option  does  not  appear  to  give  any  advantage  either  for 
this  data  set.  As  far  as  the  best  number  of  trees,  clearly  more  is  better.  However,  there  doesn’t  appear  to  be 
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Function  Name  i 

C4.5  DefauH  C4.5  1 0  trees!  C4.5  1 00  trees 

addO 

22.561 

22.62 

23.12 

aclcl4 


chISfO 


chi  76(0 


chi  77(0 


ch22r0 


ch30(0 


ch47f0 


Ch52f4 


reoterthan 


Interval  1 


Interval2 


kddi 


kddIO 


kdd2 


kdd3 


kdd4 


6 


kdd7 

kdd8 


kdd9 


modulus2 


mux8 


39.72 


47.96 


20.54 


39.14 


29.96 


33.08 


30.76 


21.36 


44.94 


62.82 


0.64i 


25.54 


3.76 


2.561 


15.021 


26.94 

16.32 


30.66 


16.721 


23.28 


18.8 


3.38 


37.92 


6.4 


2.08 


15.18: 


1  'dbl  o 

Siia 

ut  1 

[  73.74, 

1  68 

14.24 


3.36 


28.3 


6.04 


16.94 


49.16 


17.42 


19.16 


Ql 


3.38 


35.62 


5.44 


2.08 


14.6 


Ch70f3 

15.51 

15.08 

14.72 

Ch74f1 

20.91 

20.12 

19.66 

Ch83f2 

33.421 

32.24 

32.42 

ch8(0 

20.84! 

16.84 

16 

16.2 


49.56 


17.16 


md_m10 


md_m25 


md_m50 


mdl 


md2 


md3 


substrl 


1 


11.02 


29.34 

30 

30 

5.74] 

5.52 

5.52 

54.4 

55.14 

55.08 

87.841 

86.46 

85.68 

^1 


84.34 


85.44 


83.68 


36.22 


subtractl 

64.141 

53.02 

52.94 

Avera 


dll  Functions 


30.87770833  30.34104167 


Table  1:  C4.5  trials  with  deiaiilt  options,  varying  the  number  of  trees 
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Table  2:  C4.5  trials  with  window  sise  0,  vazying  the  nnmbei  of  trees 


IC4.5  1  tree 

C4.5  1 0  trees 

C4.5  1 00  trees 

Function  Name 

!  Threshold  =  0 

Threshold  =  0 

Threshold  =  0 

addO 


IPggPl 


ddd4 


ch177fO 


15.93 


44. 


27.29 


m 


Chl5f0 

33.25 1 

27.98 

25.88 

chl76f0 

1  13.49; 

5.54 

5.54 

26.42 


Ch22f0 

22.871 

11.63 

9.96 

Ch30f0 

11.87 

1 1.541 

12.18 

Ch47f0 

25.79 

21.52! 

20.92 

Ch52f4 

23.55 

22.63 

22.12 

Ch70f3  ! 

12.17 

12.18 

12.06 

ch74fl 

16.71 

15.79 

15.8 

ch83f2 

27.77 

26.03; 

25.94 

ch8fD 

14.56 

1 1.931 

11.6 

contalns_4_ones  i 

59.1  I 

58.49 

58.68 

kddl 

0.32! 

!  0.32! 

0.32 

kddlO 

17.73 

1  17.52 

17.96 

kdd3 


kdd5 

11.2 

kdd6  I 

2.48 

kdd7 

19.08 

_dbl  output 


58.84 


87.22 


remalndr2 

25.48 

rr>d_m1 

2.42 

md_ml0 

11.66 

md_m26 

.26.37 

md_m5 

7.53 

md_m50 

42.75 

mdl 

60.45 

md2 

61.52 

md3 

60.84 

substrl 

31.63 

substr2 

24.63 

subtracti 

48.15 

10.52 


2.48 


2 


kdd8 

10.99 

6.35 

6.03 

kdd9 

21 .54 

13.79 

13.63 

majoritv  aate 

34.86! 

36.24 

36.54 

modulus2 

12.15 

12.26 

12.29 

mux8 

19.29 

13.96 

11.44 

11.62 


25.59! 


6.99 


42.72 


61.5 


62.25 


60.48 


30.3 


22.08 


41.42 


Averaoe  over 


all  Functions 


26.4068751 


23.22375 


59.39 


86.99 


1 1.62 


25.59 


6.99 


43.01 


61.1 


61.84 


60.84 


28.67 


21.13 


41.43 


23.00625 
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C4.5  1  tree 

C4.5  1 0  trees  i 

C4.5  1 00  trees 

Function  Name 

jThresh=Wlnd.=0 

Thresh=Wlnd.=0  |Thresh=Wincl.=0 

addO 

'  15.92 

16.26i 

16.58 

add2 

36.4 

27.4. 

26.32 

add4 

:  10 

3.6 

3.6 

chl6f0 

30.8  j 

27.98: 

25.88 

chl76fD 

!  6.38! 

5.54 

5.54 

Chl77f0 


Ch22f0 


Ch30f0 


ch47«D 


Ch52f4 


ch70f3 


\P 


ch83f2 


chSfO 


contains_4_ones 


reatorttian 


Interval! 


Interval2 


kdd! 


kdclIO 


kdd2 


kdd3 


kdd4 _ 

dS 


kdd6 


kdd7 


kdd8 


kdd9 


ate 


modulus2 


muxa 


fSI 

Itg 


l_dbl  output 


ut 


md2 


rnd3 


substr! 


subslT2 


subtract! 


subtract3 


S.24i 


!5.!!  I 


!2.04! 


22.62! 


23.6!  ! 


!!.93i 


25.921 


!3.05 


58.54 


!5.95 


33.53 


45.06 


0.32 


!6.4! 


2.56 


!.28i 


O 


!1.2 


2.64 


19.87 


12.061 


17.37 


16.35 


52.161 


58.1 


86.47 


2 


SBsai 


md_mlO 

11.74 

md_m25 

25.59 

md_m6 

6.99 

md_m60 

42.65 

mdl 

61.161 

61 .63 


61.03 


31.121 


24.6 

45.09 


10 


11.63 

11.54 


21.52 


22.63 


12.181 


26.03 


11.931 


58.49 


16i 


34.28  i 


44.351 


0.321 


17.52 


2.76 


1.28 


12.26 


13.961 


16.67! 


50.77 


58.981 


86.58 


25. 


ESI 


11.62 


26.59 


6.991 


42.721 


61.51 


62.251 


60.481 


30.3; 


22.08 

41.42 


9.96 

12.18 


20.92 


22.12 


12.06 


25.94 


1  1.6 


58.68 


16.03 


33.92 


44.43 


0.32 


17.96 


^1 


10.62 

10.46 

2.48' 

2.48 

20.69 

21.81 

6.36 

1  6.03 

12.29 


11.44 


16.67 


60.13 


59.39 


86.99 


26.36 


25.59 


6.99 


43.01 


61.1 


61.84 


60.84 


28.67 


21.13 

41.43 


Avera 


all  Functions 


23.22376 


23.00625 


l^ble  4:  C4.5  tiials  with  window  and  threshold  sise  0,  varying  the  number  of  trees 
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Function  Name 


addO 


add2 


add4 


chi  510 


Ch176f0 


Ch177f0 


Ch22f0 


Ch30f0 


Ch47f0 


ch52f4 


Ch70f3 


cn74f1 


ch83l2 


chSfO 


contalns_4_ones 


reotorthan 


Intervail 


interval2 


kddi 


Threshold««-1  Thresh=0,  -I  256  Thresholds  1 


15.93  16.13  16.26 


44.78  28.69  27.42 


27.29  3.72 


EB 


F 


kdd4 


kdd5 


kdd6 


kdd7 


kdd8 


kdd9 


maioritv_aate 


modulus2 


mux8 


pal  _ 


I  dPIoutput 


ut 


substr2 


subtracti 


subtract3 


13.49 


26.42 


22.87 


11.87 


25.79 


23.55 


16.71 


2 


14.56 


59.1 


^1 


11.2 


2.48 


19.081 


10.99 


21.541 


34.85 


12.15 


19.29 


15.961 


52.9 


58.84 

87.22 


md_m1 

2.42 

md_m10 

11.66 

md_m25 

26.37 

md m5 

7.53 

24.63 


48.15 


27.291 


6.25 


5.06 


12.02 


20.761 


23.23 


16.091 


2 


11.9 


57.81 


11.391 


20.13 


7.18 


14.24 


35.75 


14.42 


16.771 


50.56 


59.23 
86.63  ~ 
26.1  ■ 


2.141 


42.05 


3. 


21.52 


22.45 


11.97 


58.49 


16 


3 


m 


kddIO 

1  17.731 

1  17.29 

1  1 7  4 

kdd2 

t  2.24 

2.76 

2.76 

kdd3 

i:6 

1.28 

1.28 

10.52 


20.7 


6.35 


13.79 


36.24 


13.72 


16.67 


50.85 


58.92 
86. 
25.45 


2.38 


6.99 


42 


rndl 

60.45 

61.38 

61.52 

rnd2 

61 .52 

61.2 

62.22 

md3 

60.84 

60.94 

60.5 

substrl 

31.63 

29.14 

30  03 

all  Functions 


23.37833333  23.2052083 


Table  5:  C4.5  tiials  with  vatying  options 
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Function  Name 


actdO 


acld2 _ 

add4 


chlSfO 


cM76fO 


cM77tO 


t  C4.5  1 0  trees 


Threshold=0/aroupin 


Ch52f4 


ch70f3 


ch74f1 


Ch83f2 


reater_ttian 


kdd3 


kdd4 


kddS 


kdd6 


kdd7 


kdd9 


IBBSEaiCTFt 


modulus2 


mux8 


lsSlB=l!ii=T 

r«CTi[iM 


remaIndrZ 


sut>tract3 


Avera 


all  Functions 


16.261 


27.4 

3.6 


27.98 


5.64 


Bl 


Tl-iresnold*!  /aroupin 


22.631 


12.18 


1S.79j 

26.031 


1.28 


O 


10.52 


2.48 


20.69 


E^l 


13.79 


36.24 


12.26 


13.96 


16.67 


50.77 
58.98  ~ 
86.58" 
25.49 


2.38 


41 .42 


3 


Ql 


16.26 


27.42 

3. 


27.87 


5.54 


O 


22.45 


12.19 


15.83 

26.03 


Interval  1 

34.28 

34.1 

Interval2 

44.35 

44.2 

kddi 

0.32 

0.32 

kddIO 

17.52 

17.4 

10.52 


2.48 


6.35 


13.79 


36.24 


12.24 


13.72 


16.67 


50.85 

58.92 

86.58 

25.46 


2.38 


md_m10 

11.62 

11.62 

md_m25 

26.59 

25.59 

md_m6 

6.99 

6.99 

md_m50 

42.72 

42.72 

rndl 

61.5 

61.52 

md2 

62.26 

62.22 

md3 

60.48 

60.5 

substrl 

30.3 

30.03 

substr2 

22.08 

22.33 

41.44 


23.20520833 


Tsble  6:  C4.5  tiials  with  giouj^g  options 


11 


any  significant  benefit  going  &om  10  trees  to  100.  In  fact,  we  have  no  reason  to  believe  that  there  will  be  any 
significant  difference  between  10  trees  and  1000.  Thus  our  best  set  for  which  we  will  test  against  Pattern  Theory 
will  be  with  all  the  options  set  at  default  except  the  weight  (threshold)  will  be  0  (-m  0)  and  the  number  of  trees 
will  be  10  (-t  10).  We  wiU  use  -m  0  over  -m  1  because  this  is  the  lowest  setting  we  can  have  that  corresponds  to 
not  having  any  noise  in  the  data. 


5  Experimental  Results 

Now  that  we  have  the  best  options  possible  for  C4.5  over  our  benchmark  set  of  functions,  we  can  test  it  with 
honesty  against  Pattern  Theory.  This  section  refers  to  the  learning  curves  for  every  function  tested.  The  curves 
themselves  are  shown  in  Appendix  A.  The  sets  are  displayed  with  C4.5  first  on  a  given  function,  then  Pattern 
Theory.  For  a  given  graph,  the  Y-axis  is  the  number  of  errors  and  the  X-axis  is  the  number  of  training  samples. 
Each  graph  includes  the  maximum,  minimum,  and  avers^e  error.  The  experiments  stopped  if  the  maximum  error 
reached  0  (i.e.,  for  all  10  runs,  there  were  no  errors).  Thus,  there  would  not  be  any  data  points  beyond  that 
particular  sample  siie.  The  chance  line,  represented  by  dashes,  is  the  error  expected  if  we  were  to  randomly 
guess  on  the  remaining  cases.  We  would  expect  to  get  half  of  them  right  and  half  wrong  since  there  are  only  two 
outcomes.  On  the  graphs  for  FLASH  (Pattern  Theory),  there  are  additional  points  plotted  corresponding  to  the 
calculated  DFC  and  the  number  of  “don’t  cares”  for  a  g^ven  sampling.  We  can  also  see  that  functions  that  are 
highly  patterned  have  a  low  DFC  while  more  complicated  patterns  have  a  higher  DFC.  Moreover,  the  random 
functions  have  a  very  high  DFC. 

Earlier,  we  gave  some  background  about  how  function  decomposition  works  and  thus  how  FLASH  works.  What 
heis  not  been  described  is  the  actual  search  procedure  that  FLASH  uses  in  order  to  select  a  partition.  First,  the 
same  strategy  was  used  for  every  experiment.  Essentially,  it  is  a  two-ply  look  ahead  on  all  possible  partitions. 
The  calculated  DFC  is  used  to  continue  selecting  partitions  until  they  no  longer  decompose.  The  actual  strategy 
itself  is  given  in  Appendix  B.  The  name  of  the  decomposition  plan  is  dniOeSOO. 


6  Analysis 

If  we  examine  C4.5’s  performance  as  a  whole,  its  ability  ranges  &om  extremely  good  to  extremely  poor. 
C4.5’s  performance  was  excellent  for  the  Boolean  Expression  functions;  it  had  a  respectable  performance  for 
PAL,  MUXS,  MODULUS2,  and  GREATER-THAN;  it  hu  a  hard  time  with  the  other  string  functions,  and  it  is 
poor  at  learning  the  other  PAL  functions.  C4.5  is  especially  poor  at  PARITY.  Its  performance  on  the  character 
functions  are  mixed.  Some  it  learns  very  well  and  others,  the  performance  is  fair.  A  few  anomalies  were  the 
fact  that  C4.5  performed  very  well  on  ADD4  and  ADDO  but  poorly  on  ADD2.  It  was  even  more  bisarre  to  see 
exceUent  performance  on  SUBTRACTIONS  and  very  poor  performance  on  SUBTRACTION!. 

Comparing  C4.5  and  FLASH  (Pattern  Theory/Function  Decomposition),  C4.5  beats  FLASH  for  only  two  func¬ 
tions:  KDD2  and  KDD3.  It  is  equal  or  slightly  better  for  the  two  character  functions:  CH52F4  and  CH83F2.  For 
all  of  the  other  functions  however,  FLASH  outperforms  C4.5.  In  some  cases,  the  performance  margin  is  substan¬ 
tial.  The  notable  cases  are:  ADD2,  ADD4,  CONTAINS-4_ONES,  KDD7,  KDD9,  KDDIO,  MAJORITY-GATE, 
PARITY,  SUBTRACTION!,  and  SUBTRACTIONS.  Of  course,  we  are  not  concerned  with  comparing  perfor¬ 
mance  on  different  random  functions.  Their  purpose  is  to  measure  consistency  and  normal  behavior.  It  would  be 
unusual  for  any  method  to  be  significantly  better  in  some  random  function  than  another. 

The  other  functions  were  not  mentioned  here  because  some  comparisons  might  be  construed  as  unobjective. 
Although  different  performance  is  seen  in  some  cases,  in  general,  we  see  equal  performance  or  FLASH  performing 
better.  The  above  functions  were  mentioned  specifically  because  of  the  vast  differences  between  the  two  programs. 

In  general  what  we  see  in  C4.5  is  that  it  is  unequipped  to  handle  “XOR”  type  relations.  The  inherent  problem 
is  its  inability  to  deal  with  replication  in  such  disjunctive  concents  as:  (A  and  B)  or  (C  and  D).  This  is  as 
expected  [6].  It  would  appear  that  C4.5  is  unable  to  effectively  learn  functions  that  have  an  “XOR”  or  lend 
themselves  to  “XOR.”  FLASH,  on  the  other  hand,  has  no  restrictions  in  this  area.  There  are  still  some  problems 
with  functions  that  have  deep  replication  that  prevent  FLASH  Horn  completely  learning  such  a  function  unless 
aD.  of  the  samples  are  given.  However,  its  performance  does  not  degrade  beyond  C4.5. 
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C4.5  holds  its  own  for  the  Boolean  functions  that  do  not  involve  “XOR.”  It  also  had  a  respectable  performance 
for  some  of  the  chuacter  functions.  However,  the  best  domain  for  C4.5  is  the  class  of  Boolean  Expressions.  In 
the  other  ueas,  it  does  not  stand  up  to  FLASH. 

In  Thble  7,  we  list  the  average  errors  for  all  the  functions,  similar  to  the  previous  section.  Here,  however,  we 
show  C4.5’s  best  with  FLASH.  One  can  see  from  the  total  average  error  that  FLASH  is  outperforming  C4.5  as  a 
robust  pattern  finder  in  this  domain  of  binary  variables  and  noise  free  data.  This  table  also  shows  off  to  the  side, 
how  the  average  error  changes  after  successively  removing  functions  that  C4.5  is  unable  to  handle.  Once  they  are 
all  removed,  their  respective  performances  are  nearly  equal. 

It  is  noted  to  the  reader  that  although  the  Ihble  7  provides  a  nice  compact  comparison,  it  is  not  100%  reliable. 
There  are  a  few  functions  in  which  the  average  error  does  not  correspond  to  performance.  They  are  anomalous 
and  are  few  in  number.  They  are  notably  KDD2,  Subtractions,  and  ADD4.  For  example  in  KDD2,  C4.5’s 
average  error  b  2.76  versus  FLASH’S  2.4.  However,  C4.5  learns  the  function  in  125  samples  and  FLASH  learns 
the  function  in  150  samples.  On  the  other  hand  in  Subtractions,  C4.5’s  average  is  S.6  versus  FLASH’S  0.  The 
two  performances  appear  similar.  But,  C4.5  learns  the  function  in  150  samples  and  FLASH  learns  the  function 
in  only  25  samples.  For  a  comprehensive  analysis,  the  reader  is  again  referred  to  the  individual  learning  curves 
in  Appendix  A. 

Another  attempt  is  made  to  summarise  all  of  the  data  from  the  graphs  in  Appendix  A  shown  in  Figure  S. 
Here,  we  show  the  number  of  functions  learned  versus  the  number  of  samples  for  FLASH  and  C4.5.  From  the 
figure,  we  can  see  a  clear  performance  distinction  between  the  two  methods. 


7  Conclusion  and  Summary 

In  conclusion,  FLASH  (Pattern  Theory)  was  shown  to  be  a  more  robust  pattern  finder  than  C4.5  for  our 
limited  domain  of  binary  variables  and  noise  free  data.  Again,  we  emphasise  the  point  that  Pattern  Theory  can 
be  extended  to  discrete  and  continuous  valued  variables  demonstrating  its  flexibility.  C4.5  held  its  own  in  the 
Boolean  Expression  domain  and  some  of  the  images,  however,  its  performance  was  lacking  in  comparison  in  the 
other  domains.  Specifically,  C4.5  fails  to  learn  concepts  with  implicit  “XOR”  representations  or  functions  that 
have  duplication  in  their  subtrees. 

Pattern  Theory  has  been  demonstrated  a  robust,  effective  inductive  learning  technique  comparable  to  the  best. 
The  experimental  results  show  its  learning  ability  relative  to  chance  and  C4.5.  Furthermore,  as  displayed  by  our 
graphs,  there  is  a  correlation  between  a  function  that  is  highly  “patterned”  and  a  fiuiction  that  has  a  low  DFC. 


8  Future  Work 

Some  future  directions  in  this  area  are  to  continue  testing  more  functions  like  those  in  our  experiments.  In 
fact,  a  few  of  the  functions  tested  were  added  after  most  of  the  experiments  were  performed  (the  monk  problems 
and  the  “deep  functions”)  and  were  not  included  in  the  first  7  tables.  New  functions  are  constantly  being  tested, 
but  we  had  to  wrap  up  the  discussion  at  some  point.  However,  their  graphs  were  included  in  the  Appendix  A  for 
study.  In  addition,  it  is  planned  to  increase  the  number  of  variables  to  as  many  as  30  in  the  immediate  future. 
We  are  also  looking  into  adapting  the  current  program  to  handle  discrete  and  continuous  variables.  Furthermore, 
we  ultimately  plan  to  incorporate  methods  of  handling  noise.  Moreover,  we  are  looking  for  ways  to  increase  the 
speed  by  limiting  the  exploration  of  the  partition  search  space. 

We  have  a  working  theoretical  result  of  appl3ring  function  decomposition  to  continuous  variables  and  the 
searching  ability  is  getting  better.  At  this  point,  the  real  limitation  is  the  number  of  variables  and  noise.  Since 
function  decomposition  involves  an  exponential  search  space,  the  only  hope  is  using  some  method  to  prune  the 
branches  of  the  tree.  At  the  rate  the  work  is  going,  it  is  very  possible  that  at  the  time  of  this  printing,  we  will  be 
able  to  handle  up  to  100  variables  with  the  same  accuracy. 

Noise,  on  the  other  hand,  is  a  more  difficult  problem.  At  present,  we  have  no  formal  theoretical  basis  for 
dealing  with  it.  It  is,  however,  a  personal  interest  of  the  author  and  the  hope  is  to  perform  some  quality  research 
in  this  area. 
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Table  7:  C4.5’s  best  options  with  FLASH’S  best 
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Functions  Learned 


30 


Figure  3:  Generalisation  Comparison 


15 


References 


[1]  Robert  L.  Asbenboxst.  Tbe  decomposition  of  switching  fonctions.  In  Proceedings  of  ihe  International  Sympo¬ 
sium  on  the  Theory  of  Switching,  April  1957. 

[2]  Mark  L.  Axtell,  Timothy  D.  Ross,  and  Michael  J.  Noviskey.  Pattern  theory  in  algorithm  design.  In  NAECON 
Proceedings.  IEEE  and  AlAA,  May  1993. 

[3]  Jeffrey  A.  Goldman.  Pattern  theoretic  knowledge  discovery.  Technical  Report  WL-TR-94-1093,  Wright  Lab¬ 
oratory,  USAF,  WL/AART,  WPAFB,  OH  45433-6543,  Angust  1994.  work  in  progress. 

[4]  Jeffrey  A.  Goldman.  Pattern  theoretic  knowledge  discovery.  In  61h,  IEEE  International  Conference  on  Tools 
with  Artificial  Intelligence.  IEEE,  November  1994. 

[5]  L.  Hyafil  and  R.L.  Rivest.  Constructing  optimal  binary  decision  trees  is  NP-Complete.  Information  Processing 
Letters,  5(1):15-17,  1976. 

[6]  Ginlia  Pagallo  and  David  Hanssler.  Boolean  feature  discovery  in  emperical  learning.  Maching  Learning, 
5:71-99,  1990. 

[7]  J.  Ross  Qninlan.  C4.5:  Programs  for  Machine  Learning.  Morgan  Kanfinann,  Palo  Alto,  California,  1993. 

[8]  Timothy  D.  Ross,  Michael  J.  Noviskey,  Timothy  N.  Taylor,  and  David  A.  Gadd.  Pattern  theory:  An  engi¬ 
neering  paradigm  for  algorithm  design.  Final  Technical  Report  WL-TR-91-1060,  Wright  Laboratory,  USAF, 
WL/AART,  WPAFB,  OH  45433-6543,  August  1991. 

[9] ,  S.  B.  Thnin  and  et.  al.  The  monk’s  problems  -  a  performance  comparison  of  different  learning  algorithms. 

Technical  report,  Carnegie  MeUon  University,  December  1991. 


16 


A  Individual  Learning  Curves  of  Each  Function  for  C4.5  and  FLASH 


This  section  is  a  set  of  graphs  described  in  the  report.  Every  graph  has  the  name  of  the  function  being  tested 
at  the  top,  the  number  of  errors  as  the  y-axis,  and  the  number  of  samples  as  the  x-ajds. 

The  tests  on  the  individual  functions  were  as  follows.  First,  each  method  was  given  a  random  set  of  data  to 
train  on  ranging  &om  25  to  250  out  of  a  total  of  256  possible  caa/ts.  Once  the  method  was  trained,  the  entire  256 
cases  were  tested  and  the  number  of  differences  were  recorded  as  errors.  This  procedure  was  repeated  10  times 
for  a  given  sample  training  sise  in  intervals  of  25  yielding  a  maximum,  minimum,  and  average  number  of  errors 
fox  each.  Thus,  the  total  number  of  runs  for  each  function  was  100  of  varying  sample  sise.  None  of  the  learning 
was  incremental.  All  of  the  tuns  were  independent. 

The  chance  line  was  calculated  as  follows.  For  a  given  sample  sise,  assume  we  simply  mimic  the  data  we 
are  given  and  then  randomly  guess  the  remaining  unknown  elements.  This  creates  a  static  learning  line  which 
represents  learning  by  chance.  For  a  given  function,  the  graph  for  C4.5  is  displayed  first  followed  by  FLASH 
(function  decomporition).  The  FLASH  graphs  also  have  some  additional  information  plotted;  the  average  DFC, 
and  the  number  of  “don’t  cares”  or  unknowns.  The  DFC  is  calculated  by  the  function  realised  for  each  sample 
sise.  Since  there  are  ten  trials  at  each  sample  sise,  an  average  DFC  is  computed. 
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B  Listing  of  the  Decomposition  Plan 

Appendix  B  shows  the  actual  decomposition  plan  used  in  out  tests  with  FLASH.  The  listing  is  a  printout  of 
the  file  dniOeSOO. 
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